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Abstract 

Websites of a particular class form increasingly complex networks, 
and new tools are needed to map and understand them. A way of 
visualizing this complex network is by mapping it. A map highlights 
which members of the community have similar interests, and reveals 
the underlying social network. In this paper, we will map a network 
of websites using Kohonen's self-organizing map (SOM), a neural-net 
like method generally used for clustering and visualization of complex 
data sets. The set of websites considered has been the Blogalia we- 
blog hosting site (based at http://www.blogalia.com/), a thriving 
community of around 200 members, created in January 2002. In this 
paper we show how SOM discovers interesting community features, its 
relation with other community-discovering algorithms, and the way it 
highlights the set of communities formed over the network. 

Keywords: Weblogs, neural networks, self-organizing maps, clus- 
tering, web-based communities, social networks. 
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1 Introduction and state of the art 



Web-based diaries, weblogs (pronounced wee-blogs) , or simply blogs [U 121 E] , 
have become increasingly popular in the last few years. Worldwide, there 
could be several million. Non-English weblogs are in the hundreds of thou- 
sands 1 . Even as weblogs are sometimes perceived as little more than post- 
adolescent rants, they actually are on-screen renderings of communities of 
readers/writers, which establish long-running relationships; these communi- 
ties include weblog owners/writers or editors, people that post comments to 
weblog stories, and silent but persistent readers, both of whom might, or 
might not, have its own weblog. A weblog by itself need not be important, 
but as part of a community, its importance cannot be disregarded. All we- 
blogs in the world can be seen as components of a set of communities, each 
one with its own idols, axioms, enemies, and hierarchies. Communities are 
not clear-cut, since a particular weblog might belong to several communi- 
ties at the same time, even though most weblogs (in fact, all weblogs in the 
Spanish-speaking community j3]) are connected to each other by a finite set 
of links. 

Since blogs perform a sort of collaborative filtering of information pub- 
lished on the web at large, and are starting to be used as knowledge man- 
agement tools, identifying communities becomes specially important. Infor- 
mation flows more easily within communities than outside them; getting a 
message across to as many persons as possible becomes, then, a matter of 
identifying communities, and the position of different sites within them. As 
straightforward as this view of the community concept might seem, the main 
problem is that there is no universally accepted definition of community in 
complex networks. Informally, it can be defined as a set of blogs (or websites) 
that share common interests, but this only begs the definition of common and 
interest. Another possible definition is to consider a community as a set of 
blogs that have a stronger relationship among them than with the rest of the 
websites of the same class. Equating relationship with hyperlinks means that 
a community is a set of weblogs that has more links within the group than to 
outside sites. However, while heavily linking implies belonging to the same 
community, the inverse does not necessarily hold: two weblogs 2 might both 

fottp:// www.blogcensus.net/ keeps a weblogs census; English-language weblogs amount 
to around one million, and the rest of the world, half a million by the time of this writing 

2 and its readers/commenters; from now on, every time we refer to weblogs in a com- 
munity context, we actually refer to the group of persons related to that weblog: readers, 



2 



link to the same one, and thus belong, in a sense, to the same community 
without being aware of each other or the community. 

In practice, data available to discover community ascription must be in- 
cluded in the web page source code, which is text formatted using HTML 
tags and some additional meta-tags; sometimes, each text can be assigned a 
time-stamp. The aforementioned common interest will have to be identified 
by using this data. From the point of view of text content, two websites 
are related if they deal approximately with the same topics. Considering 
links, two websites are related if they link to each other in either direction. 
These two definitions are actually correlated: Menczer has proved [5] that 
pages that link to each other are semantically related. Furthermore, there are 
several additional problems with communities related by content: if a com- 
munity is defined by keywords, synonyms and hypernims, if not considered or 
appropriately chosen, can lead to overseeing certain websites. This problem is 
aggravated further by the distinct characteristics of weblogs as rapidly chang- 
ing websites and not focusing on a single topic or set of topics. Using content 
requires a vector space representation, usually term frequency /inverse doc- 
ument frequency (HI Cj. This representation is usually highly-dimensional, 
much more so than using links to other members of the set of webs that 
is going to be studied. For a small set of sites, link-based representation is 
much more compact. Relationship expressed by content distance, however, 
is implicit: two weblogs talking about politics, for instance, need not know 
each other, although it is very likely that they do since at least the Span- 
ish blogosphere is connected Moreover, in many cases, communities are 
multilingual; two weblogs closely related to each other (for instance, written 
by the same author) but written in different languages (for instance, Spanish 
and Catalan, or Spanish and English) will be completely unrelated if only 
content is taken into account. 

Meta-content following protocols such as Friend of a Friend (FOAF, 0E]) 
could, in principle, be also used as network arcs, but its use is not widespread, 
and it represents simply a binary relation (either you are a FOAF or you are 
not), while links have some quantitative quality (linking several times is 
different from linking only once). 

In this work, links have been chosen over content because they are easily 
parseable from the document source; this choice allows for a low-dimensional 
representation of each blog which will be represented by a vector with as 

writer (s), commenters, and even those that link to it without even reading it 
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many components as blogs in the group under study. This obviously only 
holds if the number of relevant sites is smaller than the vocabulary needed to 
represent the same sites in a vector space model. It is also univocal: a link 
clearly identifies origin (the weblog it has been found in) and destination 
(from the URL). Links represent a real relationship among the blogs they 
join: they imply that, at least, one has read the other, which shows a kind 
of community relation. This is inferred because communities are created 
by reading, writing about other blogs or commenting on them. It is true 
that there might be other members of the community not uncovered by this 
method (for instance, loyal readers or people who use comments to partici- 
pate); similarly, a member of the community could be linked to another via 
a blog not belonging to the set of blogs under study (Blogalia, in this case); 
however, we do not attempt to say the last word about community structure 
in the blogosphere (as is usually called the set of all weblogs). Our aim is to 
portray a method to identify communities by considering hyperlinks a good 
enough indicator of community relationship. 

Content (distance in vector space) or links (number of links, or just the 
existence or not of links) are used to create a complex network of the set 
of sites under study; consequently, a community must be defined by some 
measure that distinguishes, or makes apart, some sites from others. There 
are several possible network structures that could be considered communities: 
cliques, or sets of sites that link to each other, bipartite cliques, sets of sites 
which all link to another, different, set of sites ^U], k-cores or factions, sets 
of sites connected to, at most, k other sites in the group, or bipartite cores, 
which includes both the connector and the connected sites. Most of these 
structures can be computed and displayed with programs such as Pajek 3 or 
UCINET 4 , but require some initial parameters such as the number k of links 
or the number of cores we want to divide the original set into. All of these 
are valid definitions, and can be used in some cases. However, some of them 
are restrictive in the sense that they only take into account binary relations, 
and not the link weight (number of times it has been used) or direction. In 
the case at hand, direction is important: usually, some blog that has been 
"pointed to" might not even be aware of it 5 . The majority of the concepts 
defined above do not create clear visual image of the community they are 

3 Pajek can be downloaded from http : //vlado . fmf .uni-lj . si/pub/networks/pajek/ 
4 UCINET can be downloaded from http://www.analytictech.com/ 
5 It is very likely that blog authors are aware of incoming links, and there are tools, 
such as http:/ /tech norati.com or weblog referrer logs that allow the author to monitor it 
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describing. 

Sometimes, further steps must be taken to infer complex network commu- 
nities. Some of them are geared toward specific communities, e.g. communi- 
ties expressed via web pages or email messages, like the one we are dealing 
with in this paper. Gibson et al. ^T] proposed one of the first algorithms to 
infer web communities; it defined a community as a core of central, authori- 
tative pages linked by hub pages. However, this definition is a bit fuzzy and 
does not provide clear-cut partitions of a set of websites, but it is interesting 
in the sense that it was one of the first to realize the importance of com- 
munities on the web, and to propose an algorithm to define them. Shortly 
afterwards, Flake et al. ^2] use a maximum flow/minimal cut algorithm to 
define the edges and nodes that act as boundary between communities. 

There exist other algorithms that detect partitions of the original set 
according to properties of links, as opposed to properties of nodes. One of 
these is the Girvan-Newman algorithm ^3]; which detects links that, when 
removed, isolate some part of the original set. Clusters, or communities, are 
then computed according to where these removed links are. This algorithm 
discovers communities quite efficiently, as seen in ^3], but, once again, it 
does not discover the internal structure of each community, or the features 
that defines them. 

Recently, Radicchi et al. |E| review existing community definition and 
identification methods, claiming that most community definitions are algorithm- 
dependent, and propose a new definition for community discovery that is 
independent of the algorithm. Furthermore, they simplify Girvan-Newman 
algorithm by using purely local information to compute edge betweenness. 

This paper, along with our previous work 16,, uses Kohonen's Self- 
Organizing Map ^7] , which is an unsupervised neural- network like algorithm 
that simultaneously performs clustering of input data, and maps it to a two- 
dimensional surface. Our objective is to demonstrate how the self-organizing 
map discovers underlying community structure efficiently, allows easy visual- 
ization of the complex network, highlights the underlying topic that defines 
each community, and permits assigning new websites to a community by 
merely looking at its links. 

The rest of the paper is organized as follows: first, we make a brief in- 
troduction to Kohonen's self-organizing map in section The next section 
is devoted to present the results of applying Kohonen's self-organizing map 
to community discovery in Blogalia in section |31 and, finally, our conclusions 
and an outline of future work is presented. 
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Figure 1: Fragment of a self-organizing map, composed by 8 neurons (ac- 
tually, vectors) neurons, arranged in a hexagonal neighborhood. Each circle 
labeled with V represents a vector with the same dimensions as the input 
vector in the training set 



Figure 2: Fragment of a self-organizing map with square neighborhood. 

2 Kohonen's self-organizing map 

Kohonen originally proposed his self-organizing map inspired by previous 
work done by von der Malsburg ^Hj as a model for self-organizing visual 
domains in the brain. Kohonen's SOM is composed of a set of n-dimensional 
vectors, arranged in a 2-dimensional array. Each vector is surrounded by 
other 6 (hexagonal) (see figure EJ) or 8 (rectangular arrangement) vectors 
(see figure EI). A size n neighborhood of a vector is defined as the set of other 
SOM vectors whose index differs in less than a number n. 

Kohonen's SOM, as many other heuristic methods, must be trained on 
the data it is going to model. Training proceeds as follow: 

1. A new vector from the training set (the set of data we want to be 
modeled) is chosen randomly. 

2. The closest vector, which will be called the winner, in the SOM is 
computed. 
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3. All vectors in the neighborhood of the winner are updated so that they 
become closer to the input vector by a factor a. 

4. Neighborhood size and a are updated. 

5. After a predetermined number of iterations, stop. 

The self-organization in the SOMs emerges because different neighbor- 
hoods, not the whole map, are updated every time a new vector is presented; 
and the learning proceeding in an unsupervised way. Other than that, SOM 
is similar to any other clustering algorithm such as k-means [19J, but, in 
this case, clusters are also arranged geographically. That is why it is said to 
perform a topographical mapping. 

Main applications of the self-organizing map are: 

• Visualization: projection from a high-dimensional space to a twodimensional 
map highlights hidden relationships between data set members [20J. 

• Clustering: unlike other algorithms such as k-means, each cluster will 
be represented by several vectors. 

• Interpolation or function modeling: it is not specially suited for this 
purpose, but if each vector v has an assigned value f(v), these values 
can be projected on the map, and unknown values deduced from it. 
This is specially useful if f(v) is actually a vector, or if there might be 
missing information from the input set |21. 

• Classification: if the original data set is sorted in several classes, each 
map vector can be calibrated with a class, and then used for classifi- 
cation. Even if it is not as efficient for classification as other neural 
net algorithms, the fact that it can handle missing values make it quite 
useful in those cases. Calibration can be achieved in several possible 
ways (using for instance Bayesian criteria), or additional supervised 
training using algorithms such as Learning Vector Quantization ( 22 ) 
to improve performance. 

• Vector quantization: since the map is a model of a data set, its members 
can be used to represent that data set, each vector can be quantized 
by assigning it to its closest representative in the map. 
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There are many software packages that implement SOM, such as the SOM 
Toolbox for Matlab, or the som package for R, but the most popular is prob- 
ably SOM_PAK 6 , created originally by Kohonen's team themselves. This 
package includes command-line programs for training and labelling SOMs, 
and several tools for visualizing it: sammon, for performing a Sammon pro- 
jection of data, and umat, for applying the cluster-discovery UMatrix [23] 
algorithm. We will use these programs in this paper. 

So far, the Kohonen SOM has been used for such diverse applications as 
protein secondary structure prediction [21], information retrieval [22], rum 
age visualization [2H] , and algorithm visualization [21] . In this paper we will 
take advantage of its capabilities for the discovery of communities within 
Blogalia. 



3 Mapping weblog communities 

The working set of websites corresponds to weblogs hosted by Blogalia (http:/ /www. blogalia. com/); 
it hosts around 200 weblogs, of which only 162 actually link or are linked by 
other weblogs; these are the ones used in our study. All stories, and just 
the stories (excluding information in page templates, or dynamic newsfeeds, 
for instance) published in Blogalia up to September 2003 were used for the 
study; there were around eleven thousand, which included around seventeen 
thousand links. Of those, roughly a quarter were links to other members of 
the community; this set of links will be used in this work to try to understand 
the Blogalia community structure. Each weblog is represented by the set of 
output links to other members of Blogalia. Of course, and due to this deci- 
sion, other websites or weblogs are not considered, which means some sites 
closer to some blogs hosted in Blogalia than most of the inhabitants of that 
site might be ignored; however, in this paper, our intention was to discover 
communities within Blogalia, not all communities that included webs hosted 
by Blogalia. 

In this work, each blog is represented by a vector whose components 
are the number of times it links to others in Blogalia; if a blog such as 
http : / If ernandO . blogalia . com/ links to http : / / atalaya . blogalia . com I 1 



The program is free and can be downloaded from 
http : //www. cis . hut . f i/~hynde/lvq/ 

7 Last and first author's weblogs, respectively 



8 



Table 1: Division of Blogalia into factions, as computed by UCINET. The 
number of factions was preset to 3. All the blog URLs are in the form: 
http://NAME.blogalia.com/, where the name is the string shown here. 



Faction 


Components 


#1 


caboclo csbardalladas silly tubo oracle ender pacotilla haztc-cscuchar dragon palabre- 
jas jaio-la-espia dibujantc walkyria tscl saliva mp bilbao polincsia clforastcro supcriorcs 
tcrisa simbiosis ljtarrio yildclcn quotidianum gargantual oier smith chcwic odisca os- 
ito yamato canopus cvasivas clio prestige copensar rimero gargantua pcaton aciou akin 
clcdhwcn gnudista palcofrcak jomaweb pawlcy cicncial5 daurmith jkaranka vcrbascum 
blogzinc fbenedctti javarm atalaya www rvr fcrnandO 


#2 


tannhauscr cucntacucnto qotidianum jarvarm spamzoo russcllbcattic demetro 
humcdadrclativa vcndcll unhombrctranquilo angclina bar bar a protoastronomo ocio 
hunter circulos reval 6cuerdas trunks bontos fondoazul guetto gripe acuarioland 
cacharreando clcctroducndc aire ncutrina mayoral miralado ie too yogurtu amscl xdrcus 
crisei bep cothinkhcalth omar pepino cntrclincas sanador cxploracioncs munchi borja 
copensalud planctancvcrland confrontacion blojj metro prucba blogomctro 


#3 


arclnx gofio miatalaya aldor yamisa mcliccrtc latino cstilo-005 gaccosita cstilo-007 estilo- 
006 fco rivicra kcrbcros cstilo-004 mikcl cstilo-05 cstilo-001 cstilo-003 batiburrillo cstilo- 
002 beta crizoazul magufos clcubo profes forward isilicn maiz clda hispamcd cominaii 
sieyin kakasico luiso niorwcn vent anas puttcn cca pipodols jeohen cthulhunam rubcnlnx 
robcrtfernandez mirada csccpticismo neuronal cnpclotas hadcz dcsarrollo rivcndcl hronia 



7 times, the corresponding element will hold the value 7. Incoming and out- 
going links are considered separately. 

UCINET was used to compute factions, that is, set of blogs which all 
point to the same blogs. Results are shown in Table ^ The number of 
factions was preset to 3. In this case, the first faction corresponds roughly to 
the densely connected cluster shown in figure ??; the third, to the sparsely 
connected group of blogs, and the second, to all the blogs in between. 

The same data was analyzed using Kohonen's self-organizing map. The 
software used was S0M_PAK version 3.2, with the parameters shown in table 
121 The algorithm was run 30 times with the same parameters, but different 
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Table 2: . Parameters used to train Kohonen's self-organizing map in this 
paper. The algorithm was run 30 times, and the map with a minimal squares 
error was chosen. Values chosen for these parameters are more or less stan- 
dard: following Kohonen's advice, map shape is rectangular, size as small as 
possible, and the length of training periods is around 10 and 100 times the 
size of the training set. 



Parameter name 


Value 


Neighborhood type 


Hexa 


Neighborhood function 


Bubble 


Map x size 


9 


Map y size 


7 


First training period: length 


2000 


Neighborhood radius 


9 


Training constant 


0.1 


Second training period: length 


10000 


Neighborhood radius 


1 


Training constant 


0.02 



random initial conditions . 

From the links array, two different analysis were performed: by rows and 
columns. Rows represent the set of blogs every blog links to, and columns 
represent the set of blogs that links to a particular one. That means that 
SOM was applied to blogs represented by incoming and outgoing links. On 
each map, Umatrix analysis was applied: this analysis shows how the 
set is clustered, so that natural clusters tend to stand out. 

Different results have been obtained by training representing blogs by 
incoming or outgoing links. In the first case, shown in figure 5, a single 
block, containing the most usually linked-to blogs, stands out. This block 
roughly corresponds to the purple core shown in figure 3, and the first faction 
shown in table 1. The scenario that uses outgoing links is shown in figure 4 
is a bit more discriminating, but, once again, distinguishes factions and cores 

8 The training set is available from the authors, with the condition that, if it is used 
for any scientific publication, this paper or others by the same authors, dealing with the 
same topic, is referenced. 
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Figure 3: UMatrix map obtained from the SOM trained using rows as input, 
that is, outgoing links. Clusters correspond to clear zones separated by dark 
hexagons. 
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Figure 4: UMatrix map obtained form the SOM trained using columns as 
input, that is, incoming links. 



as computed by other methods. 

But it would also be interesting to look at what makes blogs cluster 
together in a single node, or what they have in common. It would be cum- 
bersome to look at each and every node, but, if we look at a couple of them 
(for instance, the Southwest corner of figure IH we obtain the plot shown in 
figure 6: most of them have a peak of links to pawley; for instance, elda has 
a single link, and it corresponds to that blog; pawley has also many links 
to itself, and so on. There are also some other coincidences: a few links to 
omar, for instance. 

A similar scenario is seen at the remaining nodes: they have many links 
to a blog or set of blogs, which makes the euclidean distance among them 
relatively small. That means that the blogs mapped to a single node roughly 
correspond to bipartite cliques that is, set of nodes whose link pattern 
is 

To infer communities from this map, a first approach would then be 
to assign a community to each node, which would yield several dozens of 
communities out of the original hundreds of websites. This is not satisfactory, 
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Figure 6: UMatrix map obtained from the SOM trained using rows as input, 
that is, outgoing links. Clusters correspond to clear zones separated by dark 
hexagons. A community has been identified and outlined with a red line. 



however, for two reasons: first, nodes which are closer in the Kohonen map 
might also belong to the same community, and second, some of the blogs 
that are mapped to a single node do not actually belong to any community: 
the upper left corner, for instance, in figure H] includes all weblogs that do 
not link to any other. 

Consequently, we will have to take, a second approach, based on the 
usual clustering techniques applied to Kohonen maps postproccessed with 
the UMatrix algorithm: clusters are "white" zones surrounded by "black" 
boundaries; white zones represent nodes that are close to each other, while 
black nodes are far apart from those around it. In this case, a single commu- 
nity can be appreciated, composed by those nodes that start roughly with 
the third row and third column, and end by the next-to-last row (sixth row) 
and sixth column. This group of blogs is outlined in figure |H1 

From this figure, we can gather, in an approach advocated by [2E]> that 
there would be a single cluster, and then smaller cluster composed by one 
or, at most, two (the biggest could be one composed by 4 nodes, right above 
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the outlined cluster). Since this can be only identified by visual inspection, a 
new definition of community cannot be deduced, specially in this case when 
there is not a clear-cut division in two or more clusters. So we will introduce 
a new definition of community as the set of network nodes that fall on the 
same node of a self-organized map. This definition is functional, and, besides, 
allows assignment of new nodes just by taking into account its links to the 
members of the set under study. An additional advantage is that navigation 
from a community to another is possible, just by moving from a node to 
its neighbors on the Kohonen map. Besides, a single representative for each 
community can be extracted from each node on the network. 

There is indeed some congruence with communities defined this way and 
other concepts. In fact, we can represent factions on the Kohonen map, 
in the following way: since there are three of them, a primary color (red, 
green, blue), will be assigned to each of them; from this, each SOM node 
will be assigned an RGB color from the percentage of blogs mapped to that 
node belonging to each faction. If blogs belonging to just one faction are 
mapped to a node, it will have a primary color; if blogs belonging to two 
different factions in equal proportions are mapped to a node, the color will 
be 50%/50%, for instance, half green, half red. Results of applying this 
procedure to the maps are shown in figure [7| 

This graph shows that faction #1 as computed by UCINET is more or 
less coherent, and maps in that faction are close to each other, occupying 
the majority of the map area. Faction #1, likewise, forms the core of this 
network, being composed mainly by the strongly connected component of the 
network; in other words, the strongly connected component of the network 
occupies the biggest area in the self-organizing map. 

More information can be extracted from the self-organizing map. Why 
is this layout taken? Why are some blogs in the center, while others occupy 
the periphery or corners of the map? 

To answer this, we have plotted average closeness for each node in figure 
IH1 Apparently, there are some closeness peaks toward the center of the map, 
sloping down to the corners, which have a low average closeness. This is 
probably the feature that determines layout, although other measures such 
as betweenness centrality or other centrality measures, would have to be 
investigated. 

There is an additional advantage in using Kohonen's self-organizing map: 
besides being able to distinguish among different groups, we can navigate 
using them. Since we know that blogs mapped to a single node are close to 
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Figure 7: Graphing of factions on the Kohonen map trained with outgo- 
ing links. "Red" faction occupies a large part of the map; this red faction 
corresponds to faction #1. The other factions are not so clearly arranged 
in the map; this probably means that they do not really form a community. 
Green would correspond to faction #2, and blue to #3. Nodes with no blog 
mapped are left uncolored. 
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Figure 8: Graphing of closeness on the Kohonen map trained with outgoing 
links. Gray level corresponds to the average closeness of blogs falling on a 
particular node; the whiter, the higher the average closeness is. The node 
with highest average closeness is the one with eledhwen and others. Once 
again, nodes with an asterisk do not have any corresponding blog. 
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each other, and are also close to the blogs mapped to the nodes surround- 
ing them, we could create a path from one blog to another, or use it as a 
recommendation for users or writers of a single blog. Since it works as a 
mathematical map, another blog, not belonging to this community, can also 
be mapped to it just by taking into account links to the set of blogs already 
mapped (or links from them). 

4 Conclusion 

Web content creation has undergone lately, under the influx of easy content- 
management programs such as weblogs, an extraordinary expansion, which, 
so far, shows no sign of abating. Interest groups are created spontaneously 
among web users, and it is enlightening to study and identify these groups 
from the sociological, economical and technological point of view. Since web- 
community formation is generally spontaneous, without an explicit register 
or inscription by those that integrate them, and, besides, a particular website 
might belong to several communities, one of the first problems posed by its 
study is its identification and representation. 

In this paper, we give more details on using a technique well known in 
the pattern recognition and data mining fields: Kohonen's self-organizing 
maps; our approach was originally presented in ^B]- As has been shown in 
this paper, communities identified by analyzing self-organizing maps using 
UMatrix are on a par with those identified using other techniques, such 
as faction analysis or core extraction, with the additional advantage that 
community navigation can be achieved by using the map: blogs on the same 
node, or adjacent nodes, belong (in a fuzzy sense) to the same community. 
The self-organizing map, besides highlighting the different communities and 
groups present on the sample, make an useful visual representation. 

The authors of this work intend to continue along one of the following 
lines: 

• Using self-organizing maps to visualize evolution of a set of blogs, and 
the community formation that goes along with it, by mapping different 
stages in its life. 

• Using other algorithms, such as a fuzzy version of Kohonen's self- 
organizing map 
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• Applying different representations for each blog, using blog content, 
instead of blog links: for instance, TFIDF (term frequency /inverse 
document frequency) or latent semantic analysis. 

• Analysis of nodes with no mapped blog. Do they correspond to network 
structural gaps? Can they be used to create new blogs that bridge gaps? 

• Analysis of nodes with mapped blogs. What do they represent? 

• Mapping of complex network measures on the Kohonen map. Can it 
be used to predict any of them, or to offer a fast estimate? 
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